Simple RAG Demo

Overall Flow

This project is a small RAG system implementation, it process a file from a local path and lets you ask a question related to the document.
Here is what the output of the program looks like.

~/ » uv run main.py                    
INFO:__main__:Started parsing document
INFO:__main__:Finished parsing, found 23577 characters
INFO:__main__:Finished chunking, created 169 chunks
Batches: 100%|███████████████████████████████████████████████████████| 4/4 [00:00<00:00, 12.10it/s]
Batches: 100%|███████████████████████████████████████████████████████| 3/3 [00:00<00:00, 15.32it/s]
INFO:__main__:Processed and stored embeddings

Please enter a question to ask: what is retrival augumented generation?

INFO:__main__:Performing semantic search
Batches: 100%|███████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.39it/s]

Retrieval-Augmented Generation (RAG) is a framework that combines the knowledge of a generative language model with an external retriever to provide accurate and relevant responses to user queries. RAG works by providing the retriever and generator work together, where the retriever retrieves relevant information from a pre-defined corpus, while the generator generates new text based on the retrieved information. This process is repeated multiple times until the generated text matches the desired output. The final output of RAG is both accurate and relevant to the user's query, ensuring that the generated content is not only informative but also engaging and useful for the user.

The overall process is broken down as seen below, the sub modules are each listed following this section.
Libraries used
- LLM: A CLI utility and Python library for interacting with Large Language Models from Simon Wilson
- PyMuPDF 1.25.5 documentation a PDF processing library

def main() -> None:
    file_path = "Introduction to Retrieval Augmented Generation (RAG) By Weaviate.pdf"
    
    # Document parsing
    logger.info("Started parsing document")
    page_content = parse_document(path=file_path)
    logger.info(f"Finished parsing, found {len(page_content)} characters")

    # Text chunking
    chunks = chunk_text(page_content, 200, 60)
    logger.info(f"Finished chunking, created {len(chunks)} chunks")

    # Embedding and storage
    collection = get_embedding_collection()
    store_chunk_embeddings(chunks, collection)
    logger.info("Processed and stored embeddings")

    # Get user query
    query = prompt("Please enter a question to ask")
    if not query:
        raise ValueError("Query cannot be empty")

    # Semantic search
    logger.info("Performing semantic search")
    similar_results = semantic_search(query, collection)

    # Process results
    # info: We are ignoring results with low score. (confidence threshold)
    if not similar_results or similar_results[0][0] < 0.6:
        logger.warning("Not enough context found")
        print("Not enough context found, please try another question")
    else:
        content_results = [content for _, content in similar_results]
        final_answer = model_run(query, content_results)
        print(final_answer)

if __name__ == "__main__":
    main()

Extract Text from PDF

Using pymudf library we can extract texts page by page and store it in a list

def parse_document(path: str):
    doc = pymupdf.open(path)

    pageContent = []

    for page in doc:  # iterate the document pages
        text = page.get_text()  # get plain text encoded as UTF-8
        pageContent.append(text)

    return "".join(pageContent)

Break the Text Contents

custom chunking function, based on options passed we are creating chunks that overlap with certain window

def chunk_text(text, size, overlap):
    chunks = []

    for i in range(0, len(text), size - overlap):
        chunk = text[i : i + size]
        chunks.append(chunk)

    return chunks

Embedding

before we store our embeddings, we first need to configure, the store, and the embedding model we like to use
- sentence-transformers/all-MiniLM-L6-v2 · Hugging Face has 384 dimensional dense vector space.
the collection we are using is an in memory one which is erased as the python program exits

def get_embedding_collection():
    embedding_model = llm.get_embedding_model("sentence-transformers/all-MiniLM-L6-v2")
    collection = llm.Collection(name="entries", model=embedding_model)
    return collection

we can now embed our chunks to the store created, llm has an option to store multiple chunks at once.

def store_chunk_embeddings(chunks, collection):
    collection.embed_multi(
        entries=((i, chunk) for i, chunk in enumerate(chunks)), store=True
    )

Search Similar Chunks

the llm library will check every row and calculate a cosine similarity score, we pass 3 as we only want the top 3 results.

def semantic_search(query, collection):
    similar_data = []
    for entry in collection.similar(query, 3):
        similar_data.append([entry.score, entry.content])
    return similar_data

Pass To LLM

we use a very small orca-mini:3b (3B parameter and 2GB file size), the system prompt inform the model to consider the passed context only.

def model_run(query, results):
    model = llm.get_model("orca-mini-3b-gguf2-q4_0")
    context = "\n".join(results)

    response = model.prompt(
        f"User query: {query} and the following Context: {context}",
        key="sk-...",
        system="You are an AI assistant that provides answers based solely on the given context and user query. Please ensure your responses are clear, concise, and directly address the user query, including only relevant information.",
    )
    return response.text()

Resource

Overall Flow​

Extract Text from PDF​

Break the Text Contents​

Embedding​

Search Similar Chunks​

Pass To LLM​

Resource​

Overall Flow

Extract Text from PDF

Break the Text Contents

Embedding

Search Similar Chunks

Pass To LLM

Resource